93 research outputs found
A discriminative view of MRF pre-processing algorithms
While Markov Random Fields (MRFs) are widely used in computer vision, they
present a quite challenging inference problem. MRF inference can be accelerated
by pre-processing techniques like Dead End Elimination (DEE) or QPBO-based
approaches which compute the optimal labeling of a subset of variables. These
techniques are guaranteed to never wrongly label a variable but they often
leave a large number of variables unlabeled. We address this shortcoming by
interpreting pre-processing as a classification problem, which allows us to
trade off false positives (i.e., giving a variable an incorrect label) versus
false negatives (i.e., failing to label a variable). We describe an efficient
discriminative rule that finds optimal solutions for a subset of variables. Our
technique provides both per-instance and worst-case guarantees concerning the
quality of the solution. Empirical studies were conducted over several
benchmark datasets. We obtain a speedup factor of 2 to 12 over expansion moves
without preprocessing, and on difficult non-submodular energy functions produce
slightly lower energy.Comment: ICCV 201
Structured learning of sum-of-submodular higher order energy functions
Submodular functions can be exactly minimized in polynomial time, and the
special case that graph cuts solve with max flow \cite{KZ:PAMI04} has had
significant impact in computer vision
\cite{BVZ:PAMI01,Kwatra:SIGGRAPH03,Rother:GrabCut04}. In this paper we address
the important class of sum-of-submodular (SoS) functions
\cite{Arora:ECCV12,Kolmogorov:DAM12}, which can be efficiently minimized via a
variant of max flow called submodular flow \cite{Edmonds:ADM77}. SoS functions
can naturally express higher order priors involving, e.g., local image patches;
however, it is difficult to fully exploit their expressive power because they
have so many parameters. Rather than trying to formulate existing higher order
priors as an SoS function, we take a discriminative learning approach,
effectively searching the space of SoS functions for a higher order prior that
performs well on our training set. We adopt a structural SVM approach
\cite{Joachims/etal/09a,Tsochantaridis/etal/04} and formulate the training
problem in terms of quadratic programming; as a result we can efficiently
search the space of SoS priors via an extended cutting-plane algorithm. We also
show how the state-of-the-art max flow method for vision problems
\cite{Goldberg:ESA11} can be modified to efficiently solve the submodular flow
problem. Experimental comparisons are made against the OpenCV implementation of
the GrabCut interactive segmentation technique \cite{Rother:GrabCut04}, which
uses hand-tuned parameters instead of machine learning. On a standard dataset
\cite{Gulshan:CVPR10} our method learns higher order priors with hundreds of
parameter values, and produces significantly better segmentations. While our
focus is on binary labeling problems, we show that our techniques can be
naturally generalized to handle more than two labels
Structured learning of sum-of-submodular higher order energy functions
Submodular functions can be exactly minimized in polynomial time, and the special case that graph cuts solve with max flow [18] has had significant impact in computer vision [5, 20, 27]. In this paper we address the important class of sum-of-submodular (SoS) functions [2, 17], which can be efficiently minimized via a variant of max flow called submodular flow [6]. SoS functions can naturally express higher order priors involving, e.g., local image patches; however, it is difficult to fully exploit their expressive power because they have so many parameters. Rather than trying to formulate existing higher order priors as an SoS function, we take a discriminative learning approach, effectively searching the space of SoS functions for a higher order prior that performs well on our training set. We adopt a structural SVM approach [14, 33] and formulate the training problem in terms of quadratic programming; as a result we can efficiently search the space of SoS priors via an extended cutting-plane algorithm. We also show how the state-of-the-art max flow method for vision problems [10] can be modified to efficiently solve the submodular flow problem. Experimental comparisons are made against the OpenCV implementation of the GrabCut interactive segmentation technique [27], which uses hand-tuned parameters instead of machine learning. On a standard dataset [11] our method learns higher order priors with hundreds of parameter values, and produces significantly better segmentations. While our focus is on binary labeling problems, we show that our techniques can be naturally generalized to handle more than two labels. 1
Dimensions of Motion: Monocular Prediction through Flow Subspaces
We introduce a way to learn to estimate a scene representation from a single
image by predicting a low-dimensional subspace of optical flow for each
training example, which encompasses the variety of possible camera and object
movement. Supervision is provided by a novel loss which measures the distance
between this predicted flow subspace and an observed optical flow. This
provides a new approach to learning scene representation tasks, such as
monocular depth prediction or instance segmentation, in an unsupervised fashion
using in-the-wild input videos without requiring camera poses, intrinsics, or
an explicit multi-view stereo step. We evaluate our method in multiple
settings, including an indoor depth prediction task where it achieves
comparable performance to recent methods trained with more supervision.Comment: Project page at https://dimensions-of-motion.github.io
Event-based photo rediscovery and resurfacing
Social networks, online photo-sharing services, messaging services, etc. include feature that provide a user with reminders of photos that may be of interest. For example, such services may resurface photos taken on the same day in the past, e.g., a year ago. Resurfacing past photos allows the user to relive memories. Viewing resurfaced photos has become a popular online activity. However, some periodic events do not occur on exactly the same day each year. For example, an annual football game may occur on different days across years (e.g., the first Monday of October, which may be a different day in the month of October), birthday celebrations which may be moved to the nearest weekend, religious holidays based on the lunar calendar, etc. This disclosure describes techniques to detect and resurface photos that depict similar periodic, e.g., annual, events that have taken place on possibly differing days. The similar annual events need not take place on the same day of the year, so long as they take place within a certain time period near the date of a particular day of interest
Automated image retargeting at scale using a generative adversarial network
Given the diverse variety of device screens, approaches that rely on manual creation of various forms of an image for each display situation do not scale. Therefore, automated techniques are required to process and resize images to adjust for different devices on which the image is to be displayed. The goal of automated retargeting is to change the aspect ratio of the original image while preserving the semantic and visual meaning of the content within the image. Currently, typical retargeting processes involve a multitude of operations, use a variety of heuristics that are tuned manually, and do not work effectively for all images. This disclosure describes a generative adversarial network (GAN) used to retarget an input image to a different aspect ratio. The described approach involves scaling each pixel row within an image by a factor between 0 and 1
Test-Time Distribution Normalization for Contrastively Learned Vision-language Models
Advances in the field of vision-language contrastive learning have made it
possible for many downstream applications to be carried out efficiently and
accurately by simply taking the dot product between image and text
representations. One of the most representative approaches proposed recently
known as CLIP has garnered widespread adoption due to its effectiveness. CLIP
is trained with an InfoNCE loss that takes into account both positive and
negative samples to help learn a much more robust representation space. This
paper reveals that the common downstream practice of taking a dot product is
only a zeroth-order approximation of the optimization goal, resulting in a loss
of information during test-time. Intuitively, since the model has been
optimized based on the InfoNCE loss, test-time procedures should also be in
alignment. The question lies in how one can retrieve any semblance of negative
samples information during inference in a computationally efficient way. To
this end, we propose Distribution Normalization (DN), where we approximate the
mean representation of a batch of test samples and use such a mean to represent
what would be analogous to negative samples in the InfoNCE loss. DN requires no
retraining or fine-tuning and can be effortlessly applied during inference.
Extensive experiments on a wide variety of downstream tasks exhibit a clear
advantage of DN over the dot product on top of other existing test-time
augmentation methods.Comment: Accepted to NeurIPS 2023, project webpage:
https://fengyuli-dev.github.io/dn-website
- …